Model Selection

Vision-Language-Action Model

# Vision-Language-Action Model

A vision-language-action model trained on the Open X Embodied Dataset, generating robot motions from language instructions and camera images

Multimodal Fusion

π0+FAST is an efficient action tokenization scheme designed for robotics by Physical Intelligence, suitable for vision-language-action tasks.

Multimodal Fusion

Jarvisvla Qwen2 VL 7B

A vision-language-action model specifically designed for Minecraft, capable of executing thousands of in-game skills based on human language commands

Transformers English

Spatialvla 4b 224 Sft Fractal

SpatialVLA is a vision-language-action model fine-tuned on the fractal dataset, primarily used for robot control tasks.

Transformers English

Spatialvla 4b 224 Sft Bridge

This model is a vision-language-action model fine-tuned on the bridge dataset based on the SpatialVLA model, specifically designed for the Simpler-env benchmark.

Transformers English

CogACT is a novel advanced Vision-Language-Action (VLA) architecture derived from Vision-Language Models (VLM), specifically designed for robot manipulation.

Multimodal Fusion

Transformers English

CogACT is a novel advanced Vision-Language-Action (VLA) architecture derived from Vision-Language Models (VLM), specifically designed for robot manipulation.

Multimodal Fusion

Transformers English

CogACT is a novel Vision-Language-Action (VLA) architecture that combines vision-language models with specialized action modules for robotic manipulation tasks.

Multimodal Fusion

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase